[client, relay] Advertise relay server IP via signal for foreign-relay fallback dial#6004
[client, relay] Advertise relay server IP via signal for foreign-relay fallback dial#6004
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe changes implement relay server IP support throughout the Netbird client and relay infrastructure. The client now parses relay server IPs from signal messages, stores them in offer/answer structures, and forwards them to the relay manager. The relay manager and client support IP-based fallback dialing when FQDN resolution fails. Dialers are extended with SNI/ServerName parameters for TLS certificate verification. The signal protocol adds an optional relayServerIP field. Changes
Sequence DiagramsequenceDiagram
participant Signal as Signal Server
participant Client as Client Engine
participant Handshaker as Handshaker
participant Manager as Relay Manager
participant RelayClient as Relay Client
participant Dialer as QUIC/WS Dialer
Signal->>Client: Signal Message (relayServerIP bytes)
Client->>Client: convertToOfferAnswer()
Client->>Client: decodeRelayIP() → netip.Addr
Client->>Handshaker: buildOfferAnswer()
Handshaker->>Handshaker: OfferAnswer.RelaySrvIP = IP
Handshaker->>Client: return OfferAnswer with IP
Client->>Client: signalOfferAnswer()
Client->>Signal: MarshalCredential(CredentialPayload with RelaySrvIP)
Note over Client,Dialer: Peer receives offer with relay IP
Client->>Manager: OpenConn(serverAddress, peerKey, serverIP)
Manager->>RelayClient: NewClientWithServerIP(serverURL, serverIP)
RelayClient->>RelayClient: connect()
RelayClient->>Dialer: dialDirect(relayURL with serverIP)
alt Direct IP dial fails
RelayClient->>Dialer: fallback to dialFQDN(relayURL)
Dialer->>Dialer: DNS resolution
end
Dialer->>Dialer: TLS with SNI=original hostname
Dialer->>RelayClient: return net.Conn
RelayClient->>RelayClient: ConnectedIP() → remote IP
RelayClient->>Manager: return connection
Manager->>Client: return connection ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
shared/signal/proto/signalexchange.proto (1)
68-76: LGTM — proper field deprecation and clear semantics.Reserving tag
9(previouslyint64 sessionId) is the correct way to prevent wire-incompatible reuse after the type change tooptional bytes sessionId = 10. The newbytes relayServerIP = 11documentation usefully calls out:
- 4-byte (IPv4) / 16-byte (IPv6) encoding (matches
netip.Addr.AsSlice()),- fallback-only usage,
- SNI/TLS verification still bound to
relayServerAddress.One consideration: also
reserved "sessionId"(the old name) alongside the tag, in case any tooling/codegen elsewhere keys off the legacy name. Optional.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/signal/proto/signalexchange.proto` around lines 68 - 76, Add a symbolic reservation for the old field name to avoid tooling/codegen reusing it: next to the existing numeric reservation "reserved 9;" add reserved "sessionId"; so that the proto contains both reserved 9; and reserved "sessionId"; while keeping the new optional bytes sessionId = 10 and bytes relayServerIP = 11 unchanged.shared/relay/client/dialer/race_dialer.go (1)
46-51: Minor: fluent setter is fine, but document concurrency expectations.
WithServerNamemutates the receiver in place. GivenRaceDialis constructed per-dial in practice, this is safe, but a one-liner clarifying thatRaceDialis not safe for concurrent reconfiguration would prevent future misuse if it's ever cached/shared.📝 Suggested doc tweak
-// WithServerName sets a TLS SNI/cert validation override. Used when serverURL -// contains an IP literal but the cert is issued for a different hostname. +// WithServerName sets a TLS SNI/cert validation override. Used when serverURL +// contains an IP literal but the cert is issued for a different hostname. +// Not safe to call concurrently with Dial; configure before invoking Dial. func (r *RaceDial) WithServerName(serverName string) *RaceDial {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/relay/client/dialer/race_dialer.go` around lines 46 - 51, The WithServerName method mutates the RaceDial receiver in place and may be misused concurrently; update the doc comment for func (r *RaceDial) WithServerName(serverName string) *RaceDial to state that it mutates the receiver, is intended to be used on a per-dial RaceDial instance, and is not safe for concurrent reconfiguration (recommend creating a new RaceDial or copying before reuse). Also mention that RaceDial is typically constructed per-dial to make the concurrency expectation explicit.shared/relay/client/dialer/race_dialer_test.go (1)
31-33: LGTM!The
_blank identifier on the newserverNameparameter is the right choice here since the mock'sdialFuncfield signature is intentionally not extended — existing tests don't exercise SNI override behavior, and the mock continues to forward onlyctx/address.Worth considering as a follow-up: a dedicated test that asserts
RaceDial.WithServerName(...)propagates the override intoDial, since the mock currently ignores it. Optional, not blocking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/relay/client/dialer/race_dialer_test.go` around lines 31 - 33, The blank identifier for the new serverName parameter in MockDialer.Dial is fine for now, but to test SNI override behavior add a follow-up unit test: extend the mock to capture the serverName (e.g. add or change MockDialer.dialFunc to accept the serverName or add a separate hook), update MockDialer.Dial to forward the serverName to that hook, then add a test calling RaceDial.WithServerName(...) and assert the mock received the expected serverName in Dial; reference MockDialer.Dial, MockDialer.dialFunc, RaceDial.WithServerName and Dial when making these changes.shared/relay/client/dialer/quic/quic.go (1)
26-45: LGTM — SNI override semantics are correct.Explicit
serverNamecorrectly takes precedence over URL-derived SNI, and the default branch preserves the prior IP-literal-skip behavior (avoids invalid SNI for[::1]:443/1.2.3.4:443).One small consideration: if a caller passes
serverNamethat itself is an IP literal,tls.Config.ServerNamewill be set to an IP. Go's crypto/tls omits the SNI extension in this case while still using the IP for certificate verification. Given the documented contract onRaceDial.WithServerName("override when serverURL contains an IP literal but cert is issued for a different hostname"), callers are expected to pass an FQDN. An optional sanity check (e.g., skip override ifnet.ParseIP(serverName) != nil) would harden against misuse, but the code is sound as-is.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/relay/client/dialer/quic/quic.go` around lines 26 - 45, The Dial function currently unconditionally sets tlsClientConfig.ServerName when serverName is provided; add a sanity check so we only assign ServerName if serverName is not an IP literal (use net.ParseIP(serverName) == nil) to avoid forcing an IP into SNI (which crypto/tls omits); update the switch/case handling in Dial (look for Dial, serverName, tlsClientConfig.ServerName) to skip the override when serverName parses as an IP and leave the URL-derived behavior intact.shared/relay/client/client.go (1)
380-386: Consider preserving the fallback dial error for diagnostics.When the primary dial fails and
dialFallbackalso fails, only the primary error is propagated andfbErris silently discarded. BecausedialFallbackdistinguishes useful operator-facing cases ("no fallback IP configured",substituteHostparse errors, actual fallback dial failure), losing it makes diagnosing why the fallback didn't recover the connection harder in the field.♻️ Suggested change
rd := dialer.NewRaceDial(c.log, dialer.DefaultConnectionTimeout, c.connectionURL, dialers...) conn, err := rd.Dial(ctx) if err != nil { fallbackConn, fbErr := c.dialFallback(ctx, dialers) if fbErr != nil { - return nil, err + return nil, fmt.Errorf("primary dial: %w; fallback: %v", err, fbErr) } conn = fallbackConn }Or use
errors.Join(err, fbErr)if joining is preferred over wrapping.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/relay/client/client.go` around lines 380 - 386, When primary dial failure falls back into c.dialFallback, preserve the fallback error instead of discarding fbErr; in the error path inside the block around c.dialFallback(ctx, dialers) return a combined error that includes both err and fbErr (e.g., using errors.Join(err, fbErr) or wrapping fbErr into err with fmt.Errorf) so callers see why the fallback also failed; update the branch that currently does "if fbErr != nil { return nil, err }" to return a composed error referencing err and fbErr while keeping the same return types and behavior when fallback succeeds and sets conn = fallbackConn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/engine.go`:
- Around line 2405-2418: The dialing fallback currently uses c.fallbackIP when
only checked by IsValid(), which allows unspecified addresses like 0.0.0.0 or ::
(produced by decodeRelayIP returning the zero netip.Addr) to be used; update the
guard in dialFallback (referencing c.fallbackIP and the dialFallback function in
shared/relay/client/client.go) to require a non-unspecified address by changing
the condition to also check IsUnspecified (e.g., treat as invalid if
!c.fallbackIP.IsValid() || c.fallbackIP.IsUnspecified()) so unspecified zero
addresses are not used as fallback targets.
---
Nitpick comments:
In `@shared/relay/client/client.go`:
- Around line 380-386: When primary dial failure falls back into c.dialFallback,
preserve the fallback error instead of discarding fbErr; in the error path
inside the block around c.dialFallback(ctx, dialers) return a combined error
that includes both err and fbErr (e.g., using errors.Join(err, fbErr) or
wrapping fbErr into err with fmt.Errorf) so callers see why the fallback also
failed; update the branch that currently does "if fbErr != nil { return nil, err
}" to return a composed error referencing err and fbErr while keeping the same
return types and behavior when fallback succeeds and sets conn = fallbackConn.
In `@shared/relay/client/dialer/quic/quic.go`:
- Around line 26-45: The Dial function currently unconditionally sets
tlsClientConfig.ServerName when serverName is provided; add a sanity check so we
only assign ServerName if serverName is not an IP literal (use
net.ParseIP(serverName) == nil) to avoid forcing an IP into SNI (which
crypto/tls omits); update the switch/case handling in Dial (look for Dial,
serverName, tlsClientConfig.ServerName) to skip the override when serverName
parses as an IP and leave the URL-derived behavior intact.
In `@shared/relay/client/dialer/race_dialer_test.go`:
- Around line 31-33: The blank identifier for the new serverName parameter in
MockDialer.Dial is fine for now, but to test SNI override behavior add a
follow-up unit test: extend the mock to capture the serverName (e.g. add or
change MockDialer.dialFunc to accept the serverName or add a separate hook),
update MockDialer.Dial to forward the serverName to that hook, then add a test
calling RaceDial.WithServerName(...) and assert the mock received the expected
serverName in Dial; reference MockDialer.Dial, MockDialer.dialFunc,
RaceDial.WithServerName and Dial when making these changes.
In `@shared/relay/client/dialer/race_dialer.go`:
- Around line 46-51: The WithServerName method mutates the RaceDial receiver in
place and may be misused concurrently; update the doc comment for func (r
*RaceDial) WithServerName(serverName string) *RaceDial to state that it mutates
the receiver, is intended to be used on a per-dial RaceDial instance, and is not
safe for concurrent reconfiguration (recommend creating a new RaceDial or
copying before reuse). Also mention that RaceDial is typically constructed
per-dial to make the concurrency expectation explicit.
In `@shared/signal/proto/signalexchange.proto`:
- Around line 68-76: Add a symbolic reservation for the old field name to avoid
tooling/codegen reusing it: next to the existing numeric reservation "reserved
9;" add reserved "sessionId"; so that the proto contains both reserved 9; and
reserved "sessionId"; while keeping the new optional bytes sessionId = 10 and
bytes relayServerIP = 11 unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a2bfe7a9-1d40-4e3e-98b0-ffed084511ed
⛔ Files ignored due to path filters (1)
shared/signal/proto/signalexchange.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (22)
client/internal/engine.goclient/internal/peer/handshaker.goclient/internal/peer/signaler.goclient/internal/peer/worker_relay.gorelay/test/benchmark_test.gorelay/testec2/relay.goshared/relay/client/client.goshared/relay/client/client_fallback_test.goshared/relay/client/client_test.goshared/relay/client/dialer/quic/quic.goshared/relay/client/dialer/race_dialer.goshared/relay/client/dialer/race_dialer_test.goshared/relay/client/dialer/ws/conn.goshared/relay/client/dialer/ws/dialopts_generic.goshared/relay/client/dialer/ws/dialopts_js.goshared/relay/client/dialer/ws/ws.goshared/relay/client/manager.goshared/relay/client/manager_fallback_test.goshared/relay/client/manager_test.goshared/relay/client/picker.goshared/signal/client/client.goshared/signal/proto/signalexchange.proto
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
shared/relay/client/dialer/race_dialer_test.go (1)
26-33: Optional: consider plumbingserverNameintodialFuncfor future test coverage.The mock currently discards the new
serverNameparameter. That's fine for the existing race-coordination tests, but if/when behavior depends onserverName(TLS SNI override is now part of the dialer contract), tests won't be able to assert it without widening thedialFuncsignature. Not a blocker for this PR.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/relay/client/dialer/race_dialer_test.go` around lines 26 - 33, The MockDialer currently ignores the new serverName parameter; update MockDialer.dialFunc signature to func(ctx context.Context, address, serverName string) (net.Conn, error), change the MockDialer.Dial method to accept the serverName param and pass it through to dialFunc (i.e., return m.dialFunc(ctx, address, serverName)), and update any test instantiations of MockDialer to supply a matching dialFunc so tests can assert on serverName (leave protocolStr unchanged).shared/relay/client/dialer/ws/ws.go (1)
88-91: Set an explicitMinVersionon the WS TLS config.Static analysis flagged this block. Since you're already amending
tls.Confighere to setServerName, it's a low-cost moment to also pin the minimum TLS version (TLS 1.2 or 1.3) and avoid relying on Go's default, which has shifted between releases.♻️ Proposed change
TLSClientConfig: &tls.Config{ RootCAs: certPool, ServerName: serverName, + MinVersion: tls.VersionTLS12, },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/relay/client/dialer/ws/ws.go` around lines 88 - 91, The TLS config in the WS dialer currently sets RootCAs and ServerName but doesn't pin a minimum TLS version; update the tls.Config used for TLSClientConfig (in ws.go where TLSClientConfig is constructed) to include MinVersion: tls.VersionTLS12 (or tls.VersionTLS13 if you prefer stricter policy) so the client refuses older protocol versions; ensure you reference the tls.Config used in the WS dialer initialization (the TLSClientConfig field) when adding this option.shared/relay/client/client.go (1)
421-447: Minor: SNI with IP-literalserverNamewon't help cert validation.When
serverURLisrels://[2001:db8::5]:443(IPv6 literal),substituteHostreturnsserverName == "2001:db8::5", whichWithServerNamewill set astls.Config.ServerName. Per RFC 6066 SNI must be a DNS name; Go's TLS stack will then verify the peer cert against that IP viax509.Certificate.VerifyHostname, which generally requires an IP SAN. In practice this fallback path is only exercised when the primary FQDN dial fails, so an IP-literalserverURLshouldn't reach here — but consider rejecting an IP-literalorigHostearly (or skippingWithServerNamein that case) to avoid silently producing an unverifiable TLS config.Not blocking; flagging since the unit test pins this behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/relay/client/client.go` around lines 421 - 447, substituteHost currently returns origHost even when it's an IP literal, which leads callers to set tls.Config.ServerName to an IP (invalid for SNI). Change substituteHost to detect and treat IP-literal original hosts specially: parse origHost (e.g., via netip.ParseAddr or net.ParseIP) and if it is an IP literal return an empty origHost (or otherwise signal “no server name”) so callers will skip WithServerName; otherwise return the hostname as before. Update substituteHost to document that an empty origHost means “do not set TLS ServerName.”
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@shared/relay/client/client_fallback_test.go`:
- Line 24: Replace the hardcoded listenAddr ("127.0.0.1:50301") with a dynamic
bind ("127.0.0.1:0") so the OS chooses an available port; after creating the
listener, read the real address via listener.Addr().String() (or extract
host/port) and use that value when building the server URL/ExposedAddress and
any client dial targets (references: listenAddr variable, ExposedAddress, and
the test setup in client_fallback_test.go), ensuring the test no longer assumes
a fixed port.
---
Nitpick comments:
In `@shared/relay/client/client.go`:
- Around line 421-447: substituteHost currently returns origHost even when it's
an IP literal, which leads callers to set tls.Config.ServerName to an IP
(invalid for SNI). Change substituteHost to detect and treat IP-literal original
hosts specially: parse origHost (e.g., via netip.ParseAddr or net.ParseIP) and
if it is an IP literal return an empty origHost (or otherwise signal “no server
name”) so callers will skip WithServerName; otherwise return the hostname as
before. Update substituteHost to document that an empty origHost means “do not
set TLS ServerName.”
In `@shared/relay/client/dialer/race_dialer_test.go`:
- Around line 26-33: The MockDialer currently ignores the new serverName
parameter; update MockDialer.dialFunc signature to func(ctx context.Context,
address, serverName string) (net.Conn, error), change the MockDialer.Dial method
to accept the serverName param and pass it through to dialFunc (i.e., return
m.dialFunc(ctx, address, serverName)), and update any test instantiations of
MockDialer to supply a matching dialFunc so tests can assert on serverName
(leave protocolStr unchanged).
In `@shared/relay/client/dialer/ws/ws.go`:
- Around line 88-91: The TLS config in the WS dialer currently sets RootCAs and
ServerName but doesn't pin a minimum TLS version; update the tls.Config used for
TLSClientConfig (in ws.go where TLSClientConfig is constructed) to include
MinVersion: tls.VersionTLS12 (or tls.VersionTLS13 if you prefer stricter policy)
so the client refuses older protocol versions; ensure you reference the
tls.Config used in the WS dialer initialization (the TLSClientConfig field) when
adding this option.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 72051d79-50ab-46a4-86d4-fa15b6865bbe
⛔ Files ignored due to path filters (1)
shared/signal/proto/signalexchange.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (22)
client/internal/engine.goclient/internal/peer/handshaker.goclient/internal/peer/signaler.goclient/internal/peer/worker_relay.gorelay/test/benchmark_test.gorelay/testec2/relay.goshared/relay/client/client.goshared/relay/client/client_fallback_test.goshared/relay/client/client_test.goshared/relay/client/dialer/quic/quic.goshared/relay/client/dialer/race_dialer.goshared/relay/client/dialer/race_dialer_test.goshared/relay/client/dialer/ws/conn.goshared/relay/client/dialer/ws/dialopts_generic.goshared/relay/client/dialer/ws/dialopts_js.goshared/relay/client/dialer/ws/ws.goshared/relay/client/manager.goshared/relay/client/manager_fallback_test.goshared/relay/client/manager_test.goshared/relay/client/picker.goshared/signal/client/client.goshared/signal/proto/signalexchange.proto
✅ Files skipped from review due to trivial changes (3)
- shared/signal/proto/signalexchange.proto
- client/internal/peer/worker_relay.go
- shared/relay/client/client_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
- client/internal/peer/signaler.go
- shared/relay/client/dialer/ws/conn.go
- client/internal/engine.go
- shared/signal/client/client.go
- shared/relay/client/dialer/ws/dialopts_generic.go
- shared/relay/client/manager.go
There was a problem hiding this comment.
🧹 Nitpick comments (4)
shared/relay/client/dialer/ws/ws.go (1)
87-92: Consider setting an explicit TLSMinVersion.The relay server does not explicitly set
MinVersionin its TLS configuration, relying on Go's default minimum (TLS 1.2). For consistency and to make security requirements explicit, pin the client totls.VersionTLS12in this configuration. This prevents reliance on Go's defaults which could theoretically change in future versions, and ensures compatibility with relay servers that don't enforce TLS 1.3.♻️ Suggested change
TLSConfig: &tls.Config{ RootCAs: certPool, ServerName: serverName, + MinVersion: tls.VersionTLS12, },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/relay/client/dialer/ws/ws.go` around lines 87 - 92, The TLS client configuration currently sets RootCAs and ServerName but omits an explicit MinVersion; update the TLSClientConfig (the struct that uses certPool and serverName) to include MinVersion: tls.VersionTLS12 so the client pins the minimum TLS version to TLS 1.2 (use the tls package constant tls.VersionTLS12 in the TLSClientConfig initialization).shared/relay/client/manager_fallback_test.go (2)
96-145: LGTM on the round-trip flow.Subtest correctly drives the foreign-relay path: Bob registers on his relay, Alice dials the broken FQDN with Bob's advertised IP and the payload echoes back. The
bobSideChgoroutine is safely bounded byctx(cancelled int.CleanupviamCancel), so at.Fatalfon the Alice side won't permanently leak the reader.One small nit: the
time.After(5*time.Second)timeout here can outlive the parentctxdeadline (20s less time already spent). That's fine in practice but consider<-ctx.Done()instead so any future tightening of the parent timeout propagates.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/relay/client/manager_fallback_test.go` around lines 96 - 145, The test "fallback IP recovers" uses time.After(5 * time.Second) which can outlive the parent test context; change the select waiting for bobSideCh to instead listen on <-ctx.Done() so the parent context deadline/cancellation propagates (look for the select that reads from bobSideCh in the subtest and replace the time.After branch with a ctx.Done() branch that fails the test, e.g., t.Fatalf with ctx.Err()).
23-23: Use ephemeral ports to prevent flakiness from port collisions.The hardcoded ports
52401and52402can cause test failures if another process binds them or if the test runs in parallel with other instances. The same package already provides afreeAddr()helper inclient_fallback_test.gothat obtains an ephemeral port vianet.Listen("tcp", "127.0.0.1:0"). Use this pattern to dynamically allocate ports and read the bound addresses from the servers, avoiding TIME_WAIT collisions and making the test safely repeatable.Also applies to: 40-40, 82-82
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/relay/client/manager_fallback_test.go` at line 23, Replace the hardcoded ports in ListenerConfig instances with addresses obtained from the existing freeAddr() helper used in client_fallback_test.go: call freeAddr() to get an ephemeral address, assign it to the Address field of server.ListenerConfig (e.g., where homeCfg := server.ListenerConfig{...} and other configs are created), and use the bound address returned by the listeners when starting servers so tests use dynamic ephemeral ports instead of "127.0.0.1:52401"/"52402"; ensure all three occurrences (the homeCfg creation and the two other ListenerConfig usages) are updated to use freeAddr() and the resulting address variables.shared/signal/client/client.go (1)
74-96: LGTM — payload struct + relay IP encoding.Gating on
IsValid()and usingUnmap().AsSlice()is the right choice: an IPv4-mapped IPv6 (e.g.::ffff:1.2.3.4) is normalized to a 4-byte slice. The decoder (decodeRelayIPinclient/internal/engine.go) already handles both 4- and 16-byte inputs correctly vianetip.AddrFromSlice, which the function's docstring explicitly documents.Optional nit:
p.Credentialis dereferenced unconditionally on line 84; if a future caller passes a zeroCredentialPayload{}this will panic. A nil-guard returning a clear error would harden the API without changing happy-path behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/signal/client/client.go` around lines 74 - 96, MarshalCredential dereferences p.Credential without checking for nil which can panic if a caller passes a zero CredentialPayload; add a nil-guard at the top of MarshalCredential (the function in shared/signal/client/client.go) that checks if p.Credential == nil and returns nil plus a clear error (e.g., fmt.Errorf("missing credential in CredentialPayload")) so callers get a descriptive error instead of a panic, keeping the current return signature (*proto.Message, error).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@shared/relay/client/dialer/ws/ws.go`:
- Around line 87-92: The TLS client configuration currently sets RootCAs and
ServerName but omits an explicit MinVersion; update the TLSClientConfig (the
struct that uses certPool and serverName) to include MinVersion:
tls.VersionTLS12 so the client pins the minimum TLS version to TLS 1.2 (use the
tls package constant tls.VersionTLS12 in the TLSClientConfig initialization).
In `@shared/relay/client/manager_fallback_test.go`:
- Around line 96-145: The test "fallback IP recovers" uses time.After(5 *
time.Second) which can outlive the parent test context; change the select
waiting for bobSideCh to instead listen on <-ctx.Done() so the parent context
deadline/cancellation propagates (look for the select that reads from bobSideCh
in the subtest and replace the time.After branch with a ctx.Done() branch that
fails the test, e.g., t.Fatalf with ctx.Err()).
- Line 23: Replace the hardcoded ports in ListenerConfig instances with
addresses obtained from the existing freeAddr() helper used in
client_fallback_test.go: call freeAddr() to get an ephemeral address, assign it
to the Address field of server.ListenerConfig (e.g., where homeCfg :=
server.ListenerConfig{...} and other configs are created), and use the bound
address returned by the listeners when starting servers so tests use dynamic
ephemeral ports instead of "127.0.0.1:52401"/"52402"; ensure all three
occurrences (the homeCfg creation and the two other ListenerConfig usages) are
updated to use freeAddr() and the resulting address variables.
In `@shared/signal/client/client.go`:
- Around line 74-96: MarshalCredential dereferences p.Credential without
checking for nil which can panic if a caller passes a zero CredentialPayload;
add a nil-guard at the top of MarshalCredential (the function in
shared/signal/client/client.go) that checks if p.Credential == nil and returns
nil plus a clear error (e.g., fmt.Errorf("missing credential in
CredentialPayload")) so callers get a descriptive error instead of a panic,
keeping the current return signature (*proto.Message, error).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3c348c82-145c-468a-b35a-556c33dea40c
⛔ Files ignored due to path filters (1)
shared/signal/proto/signalexchange.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (22)
client/internal/engine.goclient/internal/peer/handshaker.goclient/internal/peer/signaler.goclient/internal/peer/worker_relay.gorelay/test/benchmark_test.gorelay/testec2/relay.goshared/relay/client/client.goshared/relay/client/client_fallback_test.goshared/relay/client/client_test.goshared/relay/client/dialer/quic/quic.goshared/relay/client/dialer/race_dialer.goshared/relay/client/dialer/race_dialer_test.goshared/relay/client/dialer/ws/conn.goshared/relay/client/dialer/ws/dialopts_generic.goshared/relay/client/dialer/ws/dialopts_js.goshared/relay/client/dialer/ws/ws.goshared/relay/client/manager.goshared/relay/client/manager_fallback_test.goshared/relay/client/manager_test.goshared/relay/client/picker.goshared/signal/client/client.goshared/signal/proto/signalexchange.proto
✅ Files skipped from review due to trivial changes (4)
- shared/relay/client/dialer/race_dialer_test.go
- shared/relay/client/dialer/ws/dialopts_generic.go
- shared/relay/client/dialer/quic/quic.go
- shared/relay/client/client_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
- shared/relay/client/picker.go
- shared/signal/proto/signalexchange.proto
- relay/test/benchmark_test.go
- client/internal/peer/signaler.go
- shared/relay/client/client_fallback_test.go
- shared/relay/client/client.go
| // encoded as 4 bytes (IPv4) or 16 bytes (IPv6). Receivers may use it as a | ||
| // fallback dial target when DNS resolution of relayServerAddress fails. | ||
| // SNI/TLS verification still uses relayServerAddress. | ||
| bytes relayServerIP = 11; |
There was a problem hiding this comment.
Why not optional? (realayServerAddress also should be optional)
There was a problem hiding this comment.
Doesn't really make a difference for bytes, but I'll add it for consistency
| // RelayInstanceIP returns the IP address of the live home relay connection. | ||
| // Zero value if not connected. Sent alongside RelayInstanceAddress so remote | ||
| // peers can dial directly without their own DNS lookup. | ||
| func (m *Manager) RelayInstanceIP() netip.Addr { |
There was a problem hiding this comment.
Return the RelayInstanceAddress() along with this variable next to the domain. We can avoid a rare inconsistent state during a relay reconnection.
| func NewClient(serverURL string, authTokenStore *auth.TokenStore, peerID string, mtu uint16) *Client { | ||
| // is called. fallbackIP, when valid, is used as a dial-time fallback if the FQDN-based dial fails. TLS | ||
| // verification still uses the FQDN from serverURL via SNI. | ||
| func NewClient(serverURL string, fallbackIP netip.Addr, authTokenStore *auth.TokenStore, peerID string, mtu uint16) *Client { |
There was a problem hiding this comment.
Please create a new constructor with the extended parameter because the new parameter is not mandatory for all cases. NewClientForForeginServer or NewClientWithFallbackIp.
Mark relayServerAddress and relayServerIP as optional in the signal proto, return the relay instance address and IP atomically from Manager.RelayInstanceAddress to avoid divergence across reconnections, and split the relay client constructor into NewClient and NewClientWithServerIP. Rename fallback terminology to server IP throughout.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@shared/relay/client/client_serverip_test.go`:
- Around line 268-279: The helper freeAddr creates a bind-race by closing the
listener before the test server binds; change the approach so the listener is
not closed and the test server uses the existing listener. Modify freeAddr (or
replace it) to return the active net.Listener (and its port/address) instead of
closing it, and update callers/tests that currently call freeAddr to accept and
pass that listener into the server startup (or call server.Serve(listener)) so
the port remains reserved until the server owns it; ensure the function
signature and all references to freeAddr are updated accordingly.
In `@shared/relay/client/manager_serverip_test.go`:
- Around line 126-131: The test currently calls aliceConn.Read into buf and
assumes it fills the entire payload, which can flake on short reads; update the
read logic in manager_serverip_test.go (the aliceConn/read of buf for payload)
to either use io.ReadFull(aliceConn, buf) and handle its error, or capture the
returned n from aliceConn.Read and compare only buf[:n] to payload (or loop
until total bytes read equals len(payload)); ensure any error from io.ReadFull
or Read is properly checked and fails the test with a clear message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 89ed7ad4-34ba-40d0-889d-df262ce03700
⛔ Files ignored due to path filters (1)
shared/signal/proto/signalexchange.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (12)
client/internal/engine.goclient/internal/peer/handshaker.goclient/internal/peer/status.goclient/internal/peer/worker_relay.gorelay/test/benchmark_test.goshared/relay/client/client.goshared/relay/client/client_serverip_test.goshared/relay/client/manager.goshared/relay/client/manager_serverip_test.goshared/relay/client/manager_test.goshared/signal/client/client.goshared/signal/proto/signalexchange.proto
🚧 Files skipped from review as they are similar to previous changes (4)
- relay/test/benchmark_test.go
- shared/signal/proto/signalexchange.proto
- shared/signal/client/client.go
- shared/relay/client/manager_test.go
|



Describe your changes
Problem
When two peers are on different relays and P2P (ICE) cannot be established, each learns the other peer's relay address through signal as an FQDN. If that FQDN cannot be resolved on the receiving side, the foreign-relay dial fails and the peers never reach each other.
This deadlocks on first connect when the only configured exit-node peer is the one whose relay needs to be resolved: the overlay default route is installed before the exit-node WireGuard handshake completes, so DNS goes through
wt0and fails. Management's relay cache doesn't help here, the foreign peer's home relay is only learned through signal, not from management.Fix
Peer B already knows the IP it is connected to on its relay. We add that IP to the signal
Bodyso peer A can use it as a dial-time fallback when the FQDN cannot be resolved. SNI/cert validation still uses the FQDN, so TLS is unaffected.relayServerIP(bytes, tag 11) to signalBody; reserve tag 9 (previously int64 sessionId) to avoid wire-type collisionsConnectedIPreturns the resolved IP after an FQDN dialnetip.AddrthroughManager.OpenConnandClientwith a fallback dial path that substitutes the IP into the URL while keeping the original host as SNIConnectedIPafter an FQDN dial, andsubstituteHost; add an end-to-end manager test exercising the foreign-relay pathIssue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Internal protocol field; no user-visible behavior change beyond reachability of foreign relays when DNS for the FQDN is unavailable.
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
New Features
Tests